Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.
translated by 谷歌翻译
Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Socially-Aware Tasks (referred as to Risk and Social Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors. To this end, our tasks exploit the notion of immediate and future dangers of collision. Furthermore, we propose an evaluation protocol specifically designed for the Social Navigation Task in simulated environments. This is done to capture fine-grained features and characteristics of the policy by analyzing the minimal unit of human-robot spatial interaction, called Encounter. We validate our approach on Gibson4+ and Habitat-Matterport3D datasets.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022.
translated by 谷歌翻译
人类对象与铰接物体的相互作用在日常生活中很普遍。尽管单视图3D重建方面取得了很多进展,但从RGB视频中推断出一个铰接的3D对象模型仍然具有挑战性,显示一个人操纵对象的人。我们从RGB视频中划定了铰接的3D人体对象相互作用重建的任务,并对这项任务进行了五个方法家族的系统基准:3D平面估计,3D Cuboid估计,CAD模型拟合,隐式现场拟合以及自由 - 自由 - 形式网状配件。我们的实验表明,即使提供了有关观察到的对象的地面真相信息,所有方法也难以获得高精度结果。我们确定使任务具有挑战性的关键因素,并为这项具有挑战性的3D计算机视觉任务提出指示。短视频摘要https://www.youtube.com/watch?v=5talkbojzwc
translated by 谷歌翻译
现实的3D室内场景数据集在计算机视觉,场景理解,自主导航和3D重建中启用了最近的最近进展。但是,现有数据集的规模,多样性和可定制性有限,并且扫描和注释更多的耗时和昂贵。幸运的是,组合者在我们方面:现有3D场景数据集有足够的个别房间,如果有一种方法可以将它们重新组合成新的布局。在本文中,我们提出了从现有3D房间生成新型3D平面图的任务。我们确定了这个问题的三个子任务:生成2D布局,检索兼容3D房间,以及3D房间的变形,以适应布局。然后,我们讨论解决问题的不同策略,设计两个代表性管道:一个使用可用的2D楼层计划,以指导3D房间的选择和变形;另一个学习检索一组兼容的3D房间,并将它们与新颖的布局相结合。我们设计一组指标,可评估所生成的结果与三个子任务中的每一个,并显示不同的方法在这些子任务上交易性能。最后,我们调查从生成的3D场景中受益的下游任务,并讨论选择最适合这些任务的需求的方法。
translated by 谷歌翻译
最近关于3D密集标题和视觉接地的研究取得了令人印象深刻的结果。尽管这两个方面都有发展,但可用的3D视觉语言数据的有限量导致3D视觉接地和3D密度标题方法的过度问题。此外,尚未完全研究如何辨别地描述复杂3D环境中的对象。为了解决这些挑战,我们呈现D3Net,即最终的神经扬声器 - 侦听器架构,可以检测,描述和辨别。我们的D3Net以自我批评方式统一3D密集的标题和视觉接地。D3Net的这种自我关键性质还引入了对象标题生成过程中的可怜性,并且可以通过部分注释的描述启用对Scannet数据的半监督培训。我们的方法在扫描带数据集的两个任务中优于SOTA方法,超越了SOTA 3D密度标题方法,通过显着的余量(23.56%的填充剂@ 0.5iou改进)。
translated by 谷歌翻译
我们介绍了栖息地2.0(H2.0),这是一个模拟平台,用于培训交互式3D环境和复杂物理的场景中的虚拟机器人。我们为体现的AI堆栈 - 数据,仿真和基准任务做出了全面的贡献。具体来说,我们提出:(i)复制:一个由艺术家的,带注释的,可重新配置的3D公寓(匹配真实空间)与铰接对象(例如可以打开/关闭的橱柜和抽屉); (ii)H2.0:一个高性能物理学的3D模拟器,其速度超过8-GPU节点上的每秒25,000个模拟步骤(实时850x实时),代表先前工作的100倍加速;和(iii)家庭助理基准(HAB):一套辅助机器人(整理房屋,准备杂货,设置餐桌)的一套常见任务,以测试一系列移动操作功能。这些大规模的工程贡献使我们能够系统地比较长期结构化任务中的大规模加固学习(RL)和经典的感官平面操作(SPA)管道,并重点是对新对象,容器和布局的概括。 。我们发现(1)与层次结构相比,(1)平面RL政策在HAB上挣扎; (2)具有独立技能的层次结构遭受“交接问题”的困扰,(3)水疗管道比RL政策更脆。
translated by 谷歌翻译
We present PartNet: a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our dataset consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This dataset enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. Using our dataset, we establish three benchmarking tasks for evaluating 3D part recognition: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation. We benchmark four state-ofthe-art 3D deep learning algorithms for fine-grained semantic segmentation and three baseline methods for hierarchical semantic segmentation. We also propose a novel method for part instance segmentation and demonstrate its superior performance over existing methods.
translated by 谷歌翻译
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
translated by 谷歌翻译